Groenenboomj/fixes causal #575

groenenboomj · 2024-05-08T15:44:53Z

Add support for causal masking false and some testing changes:
-- backwards kernel support for bf16
-- backwards benchmark

xinyazhang

LGTM

Add Perf Kernels This is a combination of 2 commits. Add Perf Kernels Add Perf Kernels This is a combination of 6 commits. add perf-kernels fix formating issues fix unused variables and other bugs fix other issues remove scripts save check changes format save save try pre-commit check save

Change all block pointers to tensor pointers Block pointers are for nvidia TMAs. They are useful for regular loads as well but not well supported. Also cleaned up some code I came across along the way and updated comment at the top.

Add support for layouts commonly used by users. Add option for varlen / thd layout to specify equal context lengths for all batches. Also often used by users.

* remove on push for Integration Tests * rename * add post merge test * save * dtype params * skip bad config * fix more stuff

Increase CI timeout

Couple of FA optimizations Set SM scale multiplication to a constexpr. Minor asm improvement. Changed acc scaling to adjust for softmax division to multiplication with reciprocal. ~10% perf improvement. --------- Co-authored-by: Michael Melesse <[email protected]>

* streamk v0.1 * remove unused variable * fix format issues * add README * fix format issue * change num_sms to num_cus

* Add explicit multiply-reduce GEMM kernel * Remove `SPLIT_K` argument from kernel * Remove `GROUP_SIZE_M` argument from kernel * Remove conditional call to `tl.dot` from kernel * Remove table with performance data from README

micmelesse · 2024-08-13T17:02:19Z

python/perf-kernels/flash-attention.py

@@ -1259,99 +1261,93 @@ def test_op_varlen_mqa_fwd(Z, HQ, HK, N_CTX, D_HEAD, causal, dtype=torch.float16
    #(1, 16, 8192, 63),


We should test seqlens that are small so 1, 2, 4, 16, 32, 64, 128, 256, etc ...

groenenboomj requested review from vgokhale, micmelesse and xinyazhang May 8, 2024 15:45

xinyazhang approved these changes May 8, 2024

View reviewed changes

vgokhale approved these changes May 13, 2024

View reviewed changes

micmelesse and others added 11 commits July 17, 2024 05:04

skip backward (#586)

17575ea

Change all block pointers to tensor pointers (#585)

a3d784a

Change all block pointers to tensor pointers Block pointers are for nvidia TMAs. They are useful for regular loads as well but not well supported. Also cleaned up some code I came across along the way and updated comment at the top.

Add support for bshd layout (#587)

aa6685a

Add support for layouts commonly used by users. Add option for varlen / thd layout to specify equal context lengths for all batches. Also often used by users.

Post-Merge CI (#612)

dbe1173

* remove on push for Integration Tests * rename * add post merge test * save * dtype params * skip bad config * fix more stuff

Increase CI timeout (#615)

23ba546

Increase CI timeout

streamk v0.1 (#619)

52a908f

* streamk v0.1 * remove unused variable * fix format issues * add README * fix format issue * change num_sms to num_cus

Add explicit multiply-reduce GEMM kernel (#621)

1d2e066

* Add explicit multiply-reduce GEMM kernel * Remove `SPLIT_K` argument from kernel * Remove `GROUP_SIZE_M` argument from kernel * Remove conditional call to `tl.dot` from kernel * Remove table with performance data from README

Add support for causal masking as a toggle and more datatype support

51d0d92

Unify with new forward tests and set num_stages

ae4633c

groenenboomj force-pushed the groenenboomj/fixes_causal branch from fa2e4e7 to ae4633c Compare August 12, 2024 16:40

groenenboomj closed this Aug 12, 2024

groenenboomj reopened this Aug 12, 2024

groenenboomj changed the base branch from triton-mlir to main_perf August 12, 2024 16:42

revert changes to tutorial kernel

550f395

micmelesse reviewed Aug 13, 2024

View reviewed changes

micmelesse force-pushed the main_perf branch from 16b0bbf to 628e09b Compare October 28, 2024 15:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groenenboomj/fixes causal #575

Groenenboomj/fixes causal #575

groenenboomj commented May 8, 2024

xinyazhang left a comment

micmelesse Aug 13, 2024

		@@ -1259,99 +1261,93 @@ def test_op_varlen_mqa_fwd(Z, HQ, HK, N_CTX, D_HEAD, causal, dtype=torch.float16
		#(1, 16, 8192, 63),

Groenenboomj/fixes causal #575

Are you sure you want to change the base?

Groenenboomj/fixes causal #575

Conversation

groenenboomj commented May 8, 2024

xinyazhang left a comment

Choose a reason for hiding this comment

micmelesse Aug 13, 2024

Choose a reason for hiding this comment